Lecture: Data Manipulation and Transformation

Actuarial Data Science - Open Learning Resource

Fei Huang, UNSW Sydney

Introduction

Motivation

  • It is rare that you get the data in exactly the form you need.
  • You may need to create new variables or summaries, or
  • Simply rename variables or reorder observations to make the data easier to work with.

In practice, a large part of an actuary’s work involves cleaning and reshaping data to ensure that it is trustworthy and easy to analyse.

In this lecture, we introduce a small set of powerful verbs (filter(), select(), mutate(), summarise(), group_by(), and joins) that can be combined to express many common data-wrangling tasks clearly.

Using R to Manipulate Data

  • R package: dplyr (a core member of tidyverse) for data manipulation and transformation
    • Note: dplyr overwrites some functions in base R (e.g. filter(), lag()). To use the base versions (after loading dplyr), specify stats::filter() and stats::lag().
  • Data: thenycflights13 package, which contains data on flights departing from New York City in 2013
  • We use ggplot2 to help us explore and understand the data.
#install.packages("nycflights13")
library(nycflights13)
library(tidyverse)

Data

flights #tibble, tweaked data frame to work better in tidyverse
view(flights) # will open the dataset in the RStudio viewer

See also: Filter Data

Types of variables

  • int: integers
  • dbl: doubles (real numbers)
  • chr: character vectors (strings)
  • dttm: date-times (a date + time)
  • lgl: logical vectors (TRUE or FALSE)
  • fctr: factors (categorical variables with fixed levels)
  • date: dates.

These are the common variable types used in tidyverse data frames (tibble)

Data Manipulation Functions

Functions for Data Manipulation

Functions in the dplyr package:

  • %>%: pipe operator, used to chain multiple operations together
  • glimpse(): a glimpse into the data and its structure
  • filter(): select observations (rows) that satisfy given conditions
  • arrange(): reorder rows based on variable values
  • select(): choose a subset of variables (columns)
  • mutate(): create new variables or transform existing ones
  • summarise(): reduce multiple values to a single summary
  • group_by(): group data so operations are performed within each group

Filter: Introduction

#The first argument is the name of the data frame. 
#The subsequent arguments are the expressions that filter the data frame.
filter(flights, month == 1, day == 1)
# use the assignment operator, <- to save the result
#jan1 <- filter(flights, month == 1, day == 1)
# Save and print the result at the same time
#(dec25 <- filter(flights, month == 12, day == 25))

See also: Full Data

Filter: Comparisons

To use filter() effectively, you need to know how to select observations using comparison operators. R provides the standard set:

  • > greater than
  • >= greater than or equal to
  • < less than
  • <= less than or equal to
  • != not equal to
  • == equal to

Note: be cautious when using == with floating point numbers. Consider using near() instead.

Filter: Logical Operators

  • Multiple arguments in filter() are combined with “and”: every condition must be true for a row to be included in the output.

Other logical operators:

  • &: “and”
  • |: “or”
  • !: “not”
  • x %in% y select rows where x is one of the values in y

According to De Morgan’s laws:

  • !(x & y) is equivalent to !x | !y
  • !(x | y) is equivalent to !x & !y

Exercise: Filtering Multiple Months

Question:

Find all flights that departed in November or December.

filter(flights, month %in% c(11, 12))
#Alternatively, use the following
#filter(flights, month == 11 | month == 12)

Exercise: Combining Logical Conditions

Question:

Find flights that were not delayed (on arrival or departure) by more than two hours.

filter(flights, !(arr_delay > 120 | dep_delay > 120))
#Alternatively, use the following
#filter(flights, arr_delay <= 120, dep_delay <= 120)

Filter: Missing Values

  • NA represents an unknown value, so missing values are “contagious”: almost any operation involving an unknown value will also return NA.
  • filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, you need to include them explicitly:
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)

Arrange

  • arrange(): order rows by column names (or more complex expressions)
  • Use desc() to sort a column in descending order
  • Missing values are always placed at the end
arrange(flights, year)
#arrange(flights, desc(dep_delay))

Select

  • select(): select a subset of variables (columns) by name
select(flights, year, month, day)
#select(flights, year:day)
#select(flights, -(year:day))

Select: Useful Functions

  • starts_with("abc"): select variables whose names begin with “abc”
  • ends_with("xyz"): select variables whose names end with “xyz”
  • contains("ijk"): select variables whose names contain “ijk”.
  • matches(): select variables that match a regular expression.
  • everything(): select all variables

Other related functions:

  • rename(): rename variables
  • mutate(): create new variables from existing variables

Summarise

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

Summarise with Group-by

  • This changes the unit of analysis from the entire dataset to individual groups.
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

Exercise: Summarising Flight Delays by Destination

  • Explore the relationship between distance and average delay for each destination
  • There are three steps to prepare this data:
    • Group flights by destination
    • Summarise to compute distance, average delay, and number of flights
    • Filter to remove noisy points (counts less than or equal to 20) and Honolulu (“HNL”) airport, which is much farther away than other destinations

Please have a try!

Solution: Data Manipulation

# Step 1: Group flights by destination
by_dest <- group_by(flights, dest)
# Step 2: Summarise to compute statistics for each destination
delay_summary <- summarise(by_dest,
  count = n(),                                    # Number of flights
  dist = mean(distance, na.rm = TRUE),           # Average distance
  delay = mean(arr_delay, na.rm = TRUE)          # Average arrival delay
)
# Step 3: Filter to remove noisy points and outliers
delay_summary <- filter(delay_summary, 
                       count > 20,                # Keep destinations with >20 flights
                       dest != "HNL")            # Remove Honolulu (outlier: very far)

# Create visualization: relationship between distance and delay
# It looks like delays increase with distance up to ~750 miles 
# and then decrease. Maybe as flights get longer there's more 
# ability to make up delays in the air?
delay_plot <- ggplot(data = delay_summary, 
                     mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +    # Point size represents flight count
  geom_smooth(se = FALSE)                         # Add smooth trend line
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Solution: Visualisation

print(delay_plot)

Alternative Solution Using the Pipe Operator %>%

# Same analysis using pipe operator for cleaner, more readable code
delays_summary <- flights %>% 
  group_by(dest) %>%                              # Group by destination
  summarise(
    count = n(),                                   # Number of flights per destination
    dist = mean(distance, na.rm = TRUE),          # Average distance
    delay = mean(arr_delay, na.rm = TRUE)          # Average arrival delay
  ) %>% 
  filter(count > 20, dest != "HNL")              # Filter: >20 flights, exclude Honolulu

Missing values

  • na.rm=TRUE: removes missing values
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))

Useful Summary Functions

  • Measures of location (central tendency): mean(x), median(x)
  • Measures of spread (variability): sd(x), IQR(x)
  • Measures of rank: min(x), quantile(x, 0.25), max(x)
  • Measures of position: first(x), nth(x, 2), last(x)
  • Counts: n(), sum(!is.na(x)), n_distinct(x)
    • count(tailnum, wt = distance): “count” (sum) the total number of miles a plane flew
  • Counts and proportions of logical values: sum(x > 10), mean(y == 0)

Note:

The interquartile range (IQR) is a measure of variability based on dividing a data set into quartiles. Q1 is the middle value in the lower half of the rank-ordered data. Q2 is the median value in the dataset. Q3 is the middle value in the upper half of the rank-ordered data. The interquartile range is equal to Q3 − Q1.

Grouping by Multiple Variables

# Group flights by date (year, month, day)
daily <- group_by(flights, year, month, day)
# Summarise: count flights per day
# n() returns the size of the current group
(per_day <- summarise(daily, flights = n()))

Ungrouping

# Remove grouping to get overall summary
daily %>% 
  ungroup() %>%                    # Remove grouping structure
  summarise(flights = n())         # Now counts all flights (not per day)

Relational Data

Relational Data

  • Relational data: data stored across multiple tables that are related to each other

Three families of verbs designed to work with relational data:

  • Mutating joins: add new variables to one data frame from matching observations in another

  • Filtering joins: filter observations from one data frame based on whether they match observations in another table

  • Set operations: treat observations as if they were set elements.

  • A similar database system: SQL

Dataset

  • nycflights13 contains five related tibbles:
    • flights: information about each flight
    • airlines: full carrier names
    • airports: information about each airport, identified by the faa airport code
    • planes: information about each plane, identified by its tailnum
    • weather: weather data at each NYC airport for each hour

Table Relationships

  • Each relationship involves a pair of tables
  • It is important to understand how tables are connected when working with relational data

Relations of Tibbles (Source: Wickham and Grolemund (2017), R for Data Science)

Keys

  • Primary key: a variable (or set of variables) that uniquely identifies each observation in its own table.
    • For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.
  • Foreign key: a variable (or set of variables) in one table that refers to a primary key in another table.
    • For example, flights$tailnum is a foreign key because it links each flight to a plane in the planes table.
  • A variable can be both a primary key and a foreign key. - For example, origin is part of the primary key in the weather table, and is also a foreign key that links to the airports table.

Identify the primary keys

  • Use count() the primary keys and look for entries where n is greater than 1.
# Check if tailnum is a valid primary key (should have no duplicates)
# If this returns empty, tailnum is a valid primary key
planes %>% 
  count(tailnum) %>%           # Count occurrences of each tailnum
  filter(n > 1)                 # Find any duplicates (n > 1 means not unique)

Add a Primary Key

  • What is the primary key in the flights table?
    • None
  • Surrogate key: add a primary key using mutate() and row_number()
  • A primary key and the corresponding foreign key in another table form a relationship.
  • Relationship are typically one-to-many. Many-to-many relationships can be represented using a many-to-one relationship plus a one-to-many relationship.

Exercise: Add a Surrogate Key to flights

flights %>% 
  arrange(year, month, day, sched_dep_time, carrier, flight) %>%
  mutate(flight_id = row_number()) %>%
  #This makes it possible to see every column in a data frame.
  glimpse()
Rows: 336,776
Columns: 20
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ dep_time       <int> 517, 533, 542, 544, 554, 559, 558, 559, 558, 558, 557, ~
$ sched_dep_time <int> 515, 529, 540, 545, 558, 559, 600, 600, 600, 600, 600, ~
$ dep_delay      <dbl> 2, 4, 2, -1, -4, 0, -2, -1, -2, -2, -3, NA, 1, 0, -5, -~
$ arr_time       <int> 830, 850, 923, 1004, 740, 702, 753, 941, 849, 853, 838,~
$ sched_arr_time <int> 819, 830, 850, 1022, 728, 706, 745, 910, 851, 856, 846,~
$ arr_delay      <dbl> 11, 20, 33, -18, 12, -4, 8, 31, -2, -3, -8, NA, -6, -7,~
$ carrier        <chr> "UA", "UA", "AA", "B6", "UA", "B6", "AA", "AA", "B6", "~
$ flight         <int> 1545, 1714, 1141, 725, 1696, 1806, 301, 707, 49, 71, 79~
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N39463", "N708~
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "EWR", "JFK", "LGA", "LGA",~
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ORD", "BOS", "ORD", "DFW",~
$ air_time       <dbl> 227, 227, 160, 183, 150, 44, 138, 257, 149, 158, 140, N~
$ distance       <dbl> 1400, 1416, 1089, 1576, 719, 187, 733, 1389, 1028, 1005~
$ hour           <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ minute         <dbl> 15, 29, 40, 45, 58, 59, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~
$ flight_id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ~

Mutating Joins

  • Mutating join combine variables from two tables. They match observations by their keys, then add variable from one table to the other.
  • Add columns from y to x:
    • inner_join(): keep observations that appear in both tables
    • left_join(): keep all observations in x
    • right_join(): keep all observations in y
    • full_join(): keep all observations in both x and y

# Join flights with airline names
# Select relevant columns from flights, then join with airlines table
flights %>% 
  select(year:day, hour, tailnum, carrier) %>% 
  left_join(airlines, by = "carrier")  # Join by carrier code (key)

Mutating Joins: Key Columns

  • If keys are duplicated, the join returns all possible combinations of matching rows.
  • Defining key columns
    • by=NULL: uses all variables that appear in both tables (a natural join)
    • by = "x": uses a common variable named “x”
    • by = c("a" = "b"): match variable a in table x to variable b in table y

Mutating Joins: base::merge()

dplyr merge
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE)
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)
  • dplyr joins are generally faster and preserve the order of rows.

Filtering Joins

  • Filtering joins match observations in the same way as mutating joins, but they affect rows (observations), not columns (variables).
  • There are two types:
    • semi_join(x, y): keep all observations in x that have a match in y
    • anti_join(x, y): drop all observations in x that have a match in y

Exercise: Using semi_join()

Question:

  • Find the ten most popular destinations
  • Match them back to flights

# Step 1: Find top 10 most popular destinations
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%    # Count flights per destination, sort descending
  head(10)                         # Keep only top 10

# Step 2: Filter flights to only those going to top destinations
flights %>%
  semi_join(top_dest)              # Keep only flights matching top destinations

Exercise: Using anti_join()

Question:

When connecting flights and planes, which flights do not have a match in in planes?

# Find flights with tailnums that don't exist in planes table
# These are flights with missing plane information
flights %>%
  anti_join(planes, by = "tailnum") %>%  # Keep flights NOT in planes table
  count(tailnum, sort = TRUE)             # Count occurrences of each unmatched tailnum

Join Problems

  • Start by identifying the variables that form the primary key in each table.
  • Check that none of the variables in the primary key are missing. If a value is missing, it cannot identify an observation.
  • Check that your foreign keys match primary keys in another table. The best way to do this is using anti_join().

Set Operations

  • intersect(x, y): return observations that appear in both x and y
  • union(x, y): return all unique observations in x and y
  • setdiff(x, y): return observations in x but not in y

Examples of Set Operations

# Create two sample data frames
df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
# Set operations
intersect(df1, df2)    # Rows in both df1 and df2
union(df1, df2)        # All unique rows from df1 and df2
setdiff(df1, df2)      # Rows in df1 but not in df2
# setdiff(df2, df1)    # Rows in df2 but not in df1

References

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly Media.